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(57) Abstract: A parallel hardware-based multithreaded processor is described. The processor includes a general purpose processor 
that coordinates system functions and a plurality of microcngincs that support multiple program threads. The processor also includes 
a memory control system that has a first memory controller that sorts memory references based on whether the memory references 
are directed to an even bank or an odd bank of memory and a second memory controller that optimizes memory references based 
upon whether the memory references are read references or write references. A program thread communication scheme for packet 
processing is also described. 
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THREAD SIGNALING IN Ml JITI -THREADED NETWORK PROCESSOR 

BACKGROUND 

This invention relates to network packet processing. 

Parallel processing is an efficient form of information 
processing of concurrent events in a computing process. Parallel 
processing demands concurrent execution of many programs in a 
computer, in contrast to sequential processing. In the context 
of a parallel processor, parallelism involves doing more than 
one thing at the same time. Unlike a serial paradigm where all 
tasks are performed sequentially at a single station or a 
pipelined machine where tasks are performed at specialized 
stations, with parallel processing, a plurality of stations are 
provided with each capable of performing all tasks. That is, in 
general all or a plurality of the stations work simultaneously 
and independently on the same or common elements of a problem. 
Certain problems are suitable for solution by applying parallel 
processing. 

SUMMARY 

According to an aspect of the present invention, a 
method for processing of network packets includes receiving 
network packets and operating on the network packets with a 
plurality of program threads to affect processing of the packets. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 is a block diagram of a communication system 
employing a hardware -based multithreaded processor. 

FIG. 2 is a detailed block diagram of the hardware- 
based multithreaded processor of FIG. 1. 

FIG. 3 is a block diagram of a microengine functional 
unit employed in the hardware -based multithreaded processor of 
FIGS. 1 and 2. 

FIG. 4 is a block diagram of a memory controller for 
enhanced bandwidth operation used in the hardware -based 
multithreaded processor. 

FIG. 5 is a block diagram of a memory controller for 
latency limited operations used in the hardware-based 
multithreaded processor. 

FIG. 6 is a block diagram of a communication bus 
interface in the processor of FIG. 1 depicting hardware used in 

* 

program thread signaling. 

FIGS. 7A-7B are a pictorial representation and flow 
chart useful in understanding program thread signaling with a . 
clear on read register. 

FIG. 8 is a flow chart of an inter- thread signaling 
scheme. *. 
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FIG. 9 is a flow chart of a program thread status 
reporting process. 

DESCRIPTION 

Architecture : 

Referring to FIG. 1, a communication system 10 includes 
a parallel, hardware -based multithreaded processor 12. The 
hardware -based multithreaded processor 12 is coupled to a bus 
such as a Peripheral Component Interconnect (PCI) bus 14, a 
memory system 16 and a second bus 18 . The system 10 is 
especially useful for tasks that can be broken into parallel 
subtasks or functions. Specifically hardware -based multithreaded 
processor 12 is useful for tasks that are bandwidth oriented 
rather than latency oriented. The hardware -based multithreaded 
processor 12 has multiple microengines 22 each with multiple 
hardware controlled program threads that can be simultaneously 
active and independently work on a task. 

The hardware -based multithreaded processor 12 also 
includes a central controller 20 that assists in loading 
microcode control for other resources of the hardware -based 
multithreaded processor 12 and performs other general purpose 
computer type functions such as handling protocols, exceptions, 
extra support f pr packet processing where the microengines pass 
the packets off for more detailed processing such as in boundary. 
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conditions. In one embodiment , the processor 20 is a Strong 
Arm* (Arm is a trademark of ARM Limited, United Kingdom) based 
architecture. The general purpose microprocessor 20 has an 
operating system. Through the operating system the processor 20 
can call functions to operate on microengines 22a-22f . The 
processor 20 can use any supported operating system preferably a 
real time operating system. For the core processor implemented 
as a Strong Arm architecture, operating systems such as, 
Microsoft NT real-time, VXWorks and pCUS, a freeware operating 
system available over the Internet, can be used. 

The hardware -based multithreaded processor 12 also 
includes a plurality of microengines 22a-22f . Microengines 22a- 
22f each maintain a plurality of program counters in hardware and 
states associated with the program counters. Effectively, a 
corresponding plurality of sets of program threads can be 
simultaneously active on each of the microengines 22a-22f while 
only one is actually operating at any one time. 

In one embodiment, there are six microengines 22a-22f, 
each having capabilities for processing four hardware program 
threads. The six microengines 22a- 22f operate with shared 
resources including memory system 16 and bus interfaces 24 and 
28. The memory- system 16 includes a Synchronous Dynamic Random 
Access Memory (SDRAM) controller 26a and a Static Random Access 
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Memory (SRAM) controller 26b. SDRAM memory 16a and SDRAM 
controller 26a are typically used for processing large volumes of 
data, e.g., processing of network payloads from network packets. 
The SRAM controller 26b and SRAM memory 16b are used in a 
networking implementation for low latency, fast access tasks, 
e.g., accessing look-up tables, memory for the core processor 20, 
and so forth. 

Hardware context swapping enables other contexts with 
unique program counters to execute in the same microengine . 
Hardware context swapping also synchronizes completion of tasks. 
For example, two program threads could request the same shared 
resource e.g., SRAM. Each one of these separate functional 
units, e.g., the FBUS interface 28, the SRAM controller 26a, and 
the SDRAM controller 26b, when they complete a requested task 
from one of the microengine program thread contexts reports back 
a flag signaling completion of an operation. When the flag is 
received by the microengine, the microengine can determine which 
program thread to turn on. 

As a network processor, e.g., a router, the hardware- 
based multithreaded processor 12 interfaces to network devices 
such as a media access controller device e.g., a 10/l00BaseT 
Octal MAC 13a or .-a/. Gigabit Ethernet device 13 b. In general, as a 
network processor, the hardware -based multithreaded processor 12 
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can interface to any type of communication device or interface 
that receives/sends large amounts of data. The network processor 
can function as a router 10 in a networking application route 
network packets amongst devices 13a, 13b in a parallel manner. 
With the hardware -based multithreaded processor 12, each network 
packet can be independently processed. 

The processor 12 includes a bus interface 28 that 
couples the processor to the second bus 18. Bus interface 28 in 
one embodiment, couples the processor 12 to the so-called FBUS 18 
(FIFO bus) . The FBUS interface 28 is responsible for controlling 
and interfacing the processor 12 to the FBUS 18. The FBUS 18 is 
a 64 -bit wide FIFO bus, used to interface to Media Access 

» - 

Controller (MAC) devices. The processor 12 includes a second 
interface e.g./ a PCI bus interface 24 that couples other system 
components that reside on the PCI 14 bus to the processor 12 . 

The functional units are coupled to one or more 
internal buses. The internal buses are dual, 32 bit buses (i.e., 
one bus for read and one for write) . The hardware-based 
multithreaded processor 12 also is constructed such that the sum ■ 
of the bandwidths of the internal buses in the processor 12 
exceed the bandwidth of external buses coupled to the processor 
12. The.r prppe,sspr^ r 12 l includes an internal core processor bus* 32, - a.^k!a?<w ■ 
e.g., an ASB bus (Advanced System Bus) that couples the processor 
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core 20 to the memory controllers 26a, 26b and to an ASB 
translator 30 described below. The ASB bus is a subset of the so 
called AMBA bus that is used with the Strong Arm processor core. 
The processor 12 also includes a private bus 34 that couples the 

■ 

microengine units to SRAM controller 26b, ASB translator 30 and 
FBUS interface 28. A memory bus 38 couples the memory controller 
26a, 26b to the bus interfaces 24 and 28 and memory system 16 
including flashrom 16c used for boot operations and so forth. 

Referring to PIG. 2, each of the microengines 22a-22f 
includes an arbiter that examines flags to determine the 
available program threads to be operated upon. Any program 
thread from any of the microengines 22a- 2 2f can access the SDRAM 
controller 26a, SDRAM controller 26b or FBUS interface 28. The 
SDRAM controller 26a and SDRAM controller 26b each include a 
plurality of queues to store outstanding memory reference 
requests. The queues either maintain order of memory references 
or arrange memory references to optimize memory bandwidth. 

If the memory subsystem 16 is flooded with memory 
requests that are independent in nature, the processor 12 can 
perform memory reference sorting. Memory reference sorting 
reduces dead time or a bubble that occurs with accesses to SRAM. 
Memory reference sorting allows the processor 12 to organize "** " ' 
references to memory such that long strings of reads can be 
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followed by long strings of writes. 

Reference sorting helps maintain parallel hardware 
context program threads. Reference sorting allows hiding of pre- 
charges from one SDRAM bank to another. If the memory system 16b 
is organized into an odd bank and an even bank, while the 
processor is operating on the odd bank, the memory controller 26b 
can start precharging the even bank. Precharging is possible if 
memory references alternate between odd and even banks. By 
ordering memory references to alternate accesses to opposite 
banks, the processor 12 improves SDRAM bandwidth. Additionally, 
other optimizations can be used. For example, merging 
optimizations where operations that can be merged, are merged 
prior to memory access, open page optimizations where by 
examining addresses, an opened page of memory is not reopened, 
chaining which allows for special handling of contiguous memory 
references and refreshing mechanisms, can be employed. 

The FBUS interface 28 supports Transmit and Receive 
flags for each port that a MAC device supports, along with an 
Interrupt flag indicating when service is warranted. The FBUS 

■ 

interface 28 also includes a controller 28a that performs header 
processing of incoming packets from the FBUS 18. The controller 
28a extracts the packet headers- and performs a microprogrammable 
source/destination/protocol hashed lookup (used for address 
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smoothing) in SRAM. If the hash does not successfully resolve, 
the packet header is sent to the processor core 20 for additional 
processing. The FBUS interface 28 supports the following 
internal data transactions : 

FBUS unit (Shared bus SRAM) to/from microengine. 

FBUS unit (via private bus) writes from SDRAM Un?rt. 

FBUS unit (via Mbus) Reads to SDRAM. 

* 

The FBUS 18 is a standard industry bus and includes a 
data bus, e.g., 64 bits wide and sideband control for address and 
read/write control. The FBUS interface 28 provides the ability 
to input large amounts of data using a series of input and output 
FIFO's 29a-29b. From the FIFOs 29a-29b, the microengines 22a-22f 

♦ 

fetch data from or command the SDRAM controller 26a to move data 
from a receive FIFO in. which data has come from a device on bus. 
18, into the FBUS interface 28. The data can be sent through 

* * 

memory controller 26a to SDRAM memory 16a, via a direct memory 
access. Similarly, the microengines can move data from the SDRAM 
26a to interface 28, out to FBUS 18, via the FBUS interface 28. 

Data functions are distributed amongst the 
microengines. Connectivity to the SRAM 26a, SDRAM 26b and FBUS 
28 is via command requests. A command request can be a memory 
request or a FBUS request. For example, a command request can 
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move data from a register located in a microengine 22a to a 
shared resource, e.g., an SDRAM location, SRAM location, flash 
memory or some MAC address. The commands are sent out to each of 
the functional units and the shared resources. However, the 
shared resources do not need to maintain local buffering of the 
data. Rather, the shared resources access distributed data 
located inside of the microengines . This enables microengines 
22a-22f , to have local access to data rather than arbitrating for 
access on a bus and risk contention for the bus. With this 
feature, there is a 0 cycle stall for waiting for data internal 
to the microengines 22a-22f . 

The core processor 20 also can access the shared 
resources. The core processor 20 has a direct communication to 

« 

the SDRAM controller 26a to the bus interface 24 and to SRAM, 
controller 26b via bus 32. To access the microengines 22a-22f 
and transfer registers located at any of the microengines 22a- 
22f, the core processor 20 access the microengines 22a-22f via 
the ASB Translator 30 over bus 34. The ASB Translator 30 
performs an address translation between FBUS microengine transfer 
register locations and core processor addresses (i.e., ASB bus) 
so that the core processor 20 can access registers belonging to 
the^.micrpengines 22a-22c. . . ...... . . — v r . r 

Although microengines 22 can use the register set to 

-10- 
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exchange data. A scratchpad memory 27 is also provided to permit 
microengines to write data out to the memory for other 
microengines to read. The scratchpad 27 is coupled to bus 34. 

Microengines : 

Referring to FIG. 3, an exemplary one of the 
microengines 22a-22f, e.g., microengine 22f is shown. The 
microengine includes a control store 70 which, in one 
implementation, includes a RAM of here 1,024 words of 32 bits. 
The RAM stores a microprogram that is loadable by the core 
processor 20. The microengine 22f also includes controller logic 
72. The controller logic includes an instruction decoder 73 and 
program counter (PC) units 72a-72d. The four micro program 
counters 72a-72d are maintained in hardware. The microengine 22f 
also includes context event switching logic 74 . Context event 
logic 74 receives messages (e.g., SEQ_#_EVENT_RESPONSE ; 
FBIJ3VENT_RESPONSE; SRAM _EVENT_RESPONSE; SDRAM _E VENTURES PONS E ; 
and ASB _EVENT_RESPONSE) from each one of the shared resources, 
e.g., SRAM 26a, SDRAM 26b, or processor core 20, control and 
status registers, and so forth. These messages provide 

4 

inf ormation on whether a requested function has completed. Based 
on whether or not a function requested by a program thread has.,, 
completed and signaled completion, the program thread needs to 
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wait for that completion signal, and if the program thread is 
enabled to operate, then the program thread is placed on an 
available program thread list (not shown) . The microengine 22f 
can have a maximum of, e.g., 4 program threads available. 

In addition to event signals that are local to an 
executing program thread, the microengines 22 employ signaling 
states that are global. With signaling states, an executing 
program thread can broadcast a signal state to all microengines 

■ 

22. Any program thread in the microengines can branch on these 
signaling states. These signaling states can be used to 
determine availability of a resource or whether a resource is due 
for servicing. 

The context event logic 74 has arbitration for the four 
(4) program threads.. In one embodiment, the arbitration is a 
round robin mechanism. Other techniques could be used including 
priority queuing or weighted fair queuing. The microengine 22f 
also includes an execution box (EBOX) data path 76 that includes 
an arithmetic logic unit 76a and general purpose register set 
76b. The arithmetic logic unit 76a performs arithmetic and 
logical functions as well as shift functions. The registers set 
76b has a relatively large number of general purpose registers. 
In this implementation there are 64 general purpose registers in 
a first bank, Bank A and 64 in a second bank, Bank B. The 

-12- 
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general purpose registers are windowed so that they are 
relatively and absolutely addressable. 

The microengine 22f also includes a write transfer 
register stack 78 and a read transfer stack 80. These registers 
are also windowed so that they are relatively and absolutely 
addressable. Write transfer register stack 78 is where write 
data to a resource is located. Similarly, read register stack 80 
is for return data from a shared resource. Subsequent to or 
concurrent with data arrival, an event signal from the respective 
shared resource e.g., the SRAM controller 26a, SDRAM controller 
26b or core processor 20 will be provided to context event 
arbiter 74 which will then alert the program thread that the data 
is available or has been sent. Both transfer register banks 78 
and 80 are connected to the execution box (EBOX) 76 through a 
data path. In one implementation, the read transfer register has 
64 registers and the write transfer register has 64 registers . 

Each microengine 22a-22f supports multi- threaded 
execution of four contexts. One reason for this is to allow one 
program thread to start executing just after another program 
thread issues a memory reference and must wait until that 
reference completes before doing more work. This behavior is 
critical to maintaining •'-efficient 1 'hardware- execution of the 
microengines because memory latency is significant. Stated 
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differently, if only a single program thread execution was 
supported, the microengines would sit idle for a significant 
number of cycles waiting for references to return and thereby 
reduce overall computational throughput . Multi -threaded 
execution allows an microengines to hide memory latency by 
performing useful independent work across several program 
threads. Two synchronization mechanisms are supplied in order to 
allow a program thread to issue an SRAM or SDRAM reference, and 
then subsequently synchronize to the point in time when that 
reference completes . 

One mechanism is Immediate Synchronization. In 
immediate synchronization, the microengine issues the reference 
and immediately swaps out of that context . The context will be 
signaled when the corresponding reference completes. Once 
signaled, the context will be swapped back in for execution when 
a context-swap event occurs and it is its turn to run. Thus, 
from the point of view of a single context's instruction stream, 
the microword after issuing the mem reference does not get 
executed until the reference completes. 

A second mechanism is Delayed Synchronization. In 

* 

delayed synchronization, the microengine issues the reference, 
and continues to execute some other useful work independent of 
the reference. Some time later it could become necessary to 

-14- 
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synchronize the program thread's execution stream to the 
completion of the issued reference before further work is 
performed. At this point a synchronizing microword is executed 
that will either swap out the current program thread, and swap it 
back in sometime later when the reference has completed, or 
continue executing the current program thread because the 
reference has already completed. Delayed synchronization is 
implemented using two different signaling schemes: 

If the memory reference is associated with a transfer 
register, the signal from which the program thread is triggered 
is generated when the corresponding transfer register valid bit 
is set or cleared. For example, an SRAM read which deposits data 
into transfer register A would be signaled when the valid bit for 
A is set. If the memory reference is associated with the 
transfer FIFO or the receive FIFO, instead of a transfer 
register, then the signal is generated when the reference 
completes in the SDRAM controller 26a. Only one signal state per 
context is held in the microengines scheduler, thus only one 
outstanding signal can exist in this scheme. 

Referring to FIG. 4, the SDRAM memory controller 26a 
includes memory reference queues 9 0 where memory reference 
• ■ * . i,, ■i.i'f.-i:,:- .^requests arrive from the various microengines 22a-22f. r . -»The..-' v ? ■ - 

* 

memory controller 26a includes an arbiter 91 that selects the 

-15- 



WO 01/48606 PCTAJS00/42716 



next the microengine reference requests to go to any of the 
functioning units. Given that one of the microengines is 
providing a reference request, the reference request will come 
through the address and command queue 90, inside the SDRAM 
controller 26a. If the reference request has a bit set called 
the "optimized MEM bit" the incoming reference request will be 
sorted into either the even bank queue 90a or the odd bank queue 
90b. If the memory reference request does not have a memory 
optimization bit set, the default will be to go into an order 
queue 90c. The SDRAM controller 26 is a resource which is shared 
among the FBUS interface 28, the core processor 20 and the PCI 
interface 24. The SDRAM controller 26 also maintains a state 
machine for performing READ-MODI FY- Write atomic operations. The 
SDRAM controller 26 also performs byte alignment for requests of 
data from SDRAM. 

The order queue 90c maintains the order of reference 
requests from the microengines. With a series of odd and even 
banks references it may be required that a signal is returned 
only upon completion of a sequence of memory references to both 
the odd and even banks. If the microengine 22f sorts the memory 
references into odd bank and even bank references and one of the 
banks, e.g., the. even Jsank is .drained of memory references before 

■ 

the odd bank but the signal is asserted on the last even 

* 
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reference, the memory controller 26a could conceivably signal 
back to a microengine that the memory request had completed, even 
though the odd bank reference had not been serviced. This 
occurrence could cause a coherency problem. The order queue 90c 
allows a microengine to have multiple memory references 
outstanding of which only its last memory reference needs to 
signal a completion. 

The SDRAM controller 26a also includes a high priority 
queue 90d. In the high priority queue 90d, an incoming memory 
reference from one of the microengines goes directly to the high 
priority queue and is operated upon at a higher priority than 
other memory references in the other queues . All of these 
queues, the even bank queue 90a, the odd bank queue 90b, the 
order queue 90c and the high priority queue, are implemented in a 
single RAM structure that is logically segmented into four 
different windows, each window having its own head and tail 
pointer. Since filling and draining operations are only a single 
input and a single output, they can be placed into the same RAM 
structure to increase density of RAM structures. 

The SDRAM controller 26a also includes core bus 
interface logic i.e., ASB bus 92. The ASB bus interface logic 92 
interfaces the core processor 20 to. the SDRAM controller 26a. If 
there is incoming data from the core processor 20 via ASB 

■ 
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interface 92, the data can be stored into the MEM ASB device 98 
and subsequently removed from MEM ASB device 98 through the SDRAM 
interface 110 to SDRAM memory 16a. Although not shown, the same ' 
queue structure can be provided for the reads. The SDRAM 
controller 26a also includes an engine 97 to pull data from the 
microengines and PCI bus. 

Additional queues include the PCI address queue 94 and 
ASB read/write queue 96 that maintain a number of requests. The 
memory requests are sent to SDRAM interface 110 via multiplexer 
106. The multiplexer 106 is controlled by the SDRAM arbiter 91 
which detects the fullness of each of the queues and the status 
of the requests and from that decides priority based on a 
programmable value stored in a priority service control register 
100. 

Referring to FIG. 5, the memory controller 26b for the 
SRAM is shown. The memory controller 26b includes an address and 
command queue 120. The memory controller 26b is optimized based 
on the type of memory operation, i.e., a read or a write. The 
address and command queue 120 includes a high priority queue . 
120a, a read queue 120b which is the predominant memory reference 
function that an SRAM performs, and. an order queue 120c which in 
general will include all writes to SRAM. and reads. ,that are to be 
non-optimized. Although not shown, the address and command 
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queue 120 could also include a write queue. 

The SRAM controller 26b also includes core bus 
interface logic i.e., ASB bus 122. The ASB bus interface logic 
122 interfaces the core processor 20 to the SRAM controller 26b. 
The SRAM controller 26b also includes an engine 127 to pull data 

* * 

from the microengines and PCI bus. 

The memory requests are sent to SRAM interface 140 via 
multiplexer 126. The multiplexer 126 is controlled by the SRAM 
arbiter 131 which detects the fullness of each of the queues and 
the status of the requests and from that decides priority based 
on a programmable value stored in a priority service control 
register 130. Once control to the multiplexer 126 selects a 
memory reference request, the memory reference request, is sent 
to a decoder 138 where it is decoded and an address is generated. 

The SRAM Unit maintains control of the Memory Mapped 
off-chip SRAM and Expansion ROM. The SRAM controller 26b can 
address, e.g., 16 MBytes, with, e.g., 8 MBytes mapped for SRAM 
16b, and 8 MBytes reserved for special functions including: Boot 
space via flashrom 16c; and Console port access for MAC devices 
13a, 13b and access to associated (RMON) counters. The SRAM is 
used for local look-up tables and queue management functions. 

The .SRAM controller 26b supports the following 
transactions : 

■ 
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Microengine requests (via private bus) to/from SRAM. 
Core Processor (via ASB bus) to/ from SRAM. 

The address and command queue 120 also includes a Read 
Lock Fail Queue 120d. The Read Lock Fail Queue 120d is used to 
hold read memory reference requests that fail because of a lock 
existing on a portion of memory. 

■ 

Referring to FIG. 6, communication between the 
microengines 22 and the FBUS interface Logic (FBI) is shown. The 
FBUS interface 28 in a network application can performs header 
processing of incoming packets from the FBUS 18. A key function 
which the FBUS interface performs is extraction of packet 
headers, and a mi c reprogrammable source/destination/protocol 
hashed lookup in SRAM . If the hash does not successfully 
resolve, the packet header is promoted to the core processor 28 
for more sophisticated processing. 

The FBI 28 contains a Transmit FIFO 182, a Receive FIFO 
183, a HASH unit 188 and FBI control and status registers 189. 
These four units communicate with the microengines 22, via a 
time -multiplexed access to the SRAM bus 38 which is connected to 
the transfer registers 78, 80 in the microengines. That is, all 
communications to and from the microengines are via the transfer 
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registers 78, 80. .The FBUS interface 28 includes a push state 
machine 200 for pushing data into the transfer registers during 

* 

the time cycles which the SRAM is NOT using the SRAM data bus 
(part of bus 38) and a pull state machine 202 for fetching data 

■ 

from the transfer registers in the respective microengine. 

The Hashing unit includes a pair of FIFO's 188a, 188b. 
The hash unit determines that the FBI 28 received an FBI_hash 
request. The 'hash unit 188 fetches hash keys from the calling 
microengine 22 . After the keys are fetched and hashed, the 
indices are delivered back to the calling microengine 22. Up to 
three hashes are performed under a single FBI_hash request. The 
busses 34 and 38 are each unidirectional: SDRAM_push/pull_data, 
and Sbus_push/pull_data. Each of these busses require control 
signals which will provide read/write controls to the appropriate 
microengine 22 Transfer registers. 

« 

Generally, transfer registers require protection from 
the context controlling them to guarantee read correctness. In 
particular, if a write transfer register is being used by a 
thread_l to provide data to the SDRAM 16a, thread_l does not 
overwrite this register until the signal back from SDRAM 
controller 26a indicates that this register has been promoted and 

may now be re-used. Every write does not require a signal back . 

".***•, 

from the destination indicating that the function has been 
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completed, because if the program thread writes to the same 
command queue at that destination with multiple requests, the 
order of the completion is guaranteed within that command queue, 
thus only the last command requires the signaling back to the 
program thread. However, if the program thread uses multiple 
command queues (order and read) , then these command requests are 
broken into separate context tasks, so that ordering is 
maintained via context swapping. The exception case indicated 
at the beginning of this paragraph is relative to a certain class 
of operations using an unsolicited PUSH to transfer registers 
from the FBI for FBUS status information. In order to protect 
read/write determinism on the transfer registers, the FBI 
provides a special Push_j?rotect signal when these special FBI 
push operations are set up. 

Any microengine 22 that uses the FBI unsolicited push 
technique must test the protection flag prior to accessing the 
FBUS interf ace/microengine agreed upon transfer registers. If 
the flag is not asserted, then the transfer registers may be 
accessed by the microengines 22. If the flag is asserted then 

» 

the context should wait N cycles prior to accessing the 
registers. This count is determined a priori by the number of 
transfer registers being pushed, plus a frontend protection 
window. The microengine tests this flag then moves the data from 
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the read transfer registers to GPR's in contiguous cycles, so the 
push engine does not collide with the microengine read. 

Thread signaling for packet processing 

Special techniques such as inter- thread communications 
to communicate status, a self destruct register 210 to allow 
program threads to self assign tasks and a thread_done register 
212 to provide a global program thread communication scheme are 
used for packet processing. The destruct register 210 and a 
thread_done register 212 can be implemented as control and status 
registers 189. They are shown in the FBUS interface 28 outside 
of. the block labeled CSR for clarity. Network functions are 
implemented in the network processor using a plurality of program 
threads e.g., contexts to process network packets. For example, 
scheduler program threads could be executed in one of the 
microprogram engines e.g., 22a whereas, processing program 
threads could execute in the remaining engines e.g., 22b-22f. 
The program threads (processing or scheduling program threads) 
use inter-thread communications to communicate status. 

Program threads. are assigned specific tasks such as 
receive and transmit scheduling, receive processing, and transmit 
.... processing, etc. Task assignment and task completion are 

communicated between program threads through the inter-thread 

-23- 



WOU1/486U6 



PCT/USOU/42716 



signaling, registers with specialized read and write • 
characteristics, e.g., the self-destruct register 210 and the 
thread-done register 212, SRAM 16b and data stored in the 
internal scratchpad memory 186 (FIG. 6) resulting from operations 
such as bit set, and bit clear. 

The network processor 10 includes a general context 
communication signaling protocol that allows any context to set a 
signal that any other context can detect. This allows 
cooperating program threads to use a semaphore and thus 
coordinate using micro-code controlled processing. 

Processing of network packets can use multiple program 
threads. Typically, for network processing there is a receive 
scheduler, a transmit scheduler and processing program threads. 
A scheduler (either receive or transmit) program thread 
coordinates amounts of work to be done and sequence of work by 
processing program threads. The scheduler program thread assigns 
tasks to processing program threads and in some cases processing 
program threads can assign tasks to other processing program 
threads. For instance, a scheduler determines which ports need 
service and assigns and coordinates tasks to processing program 
threads to overcome inherent memory latency by processing 

multiple .program v threads in. parallel . 

* 

In some examples, with slow ports one processing 
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program thread may perform processing on a portion of a packet 
and a second processing program thread processes the remainder of 
the packet or in some cases the scheduler uses the next, available 
program thread. With faster ports e.g., Gigabit ports where 64 
byte packets are received very fast, the scheduler can assign M 
packets to the next available program thread. The program 
threads signal each other as to what part of a packet the program 
thread has processed and its status. 

A program thread can be assigned to process the first 
64 bytes of a packet. When the program thread is finish, the 
program thread has data to set signals to wake up the next 
program thread has been assigned to process the next 64 bytes. 
The program thread may write a register and an address of the 
register in a pre-assigned memory location, e.g., scratch 
register. The program thread sets signals to wake up the next 
program thread that has been assigned to work on the next bytes 
of the packet. 

Referring to FIGS. 7A-7B, the self-destruct register 
210 allows one scheduler program thread S, (230 in FIG. 7B) to 
request services from multiple program threads P a -P n that provide 
the requested service. The first program thread, e.g., Pi that 
accesses (232 in FIG. 7B) the self_destruct register 210 takes 
the request. The "self-destruct register" 210 zeros, i.e., 
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clears (234 in FIG. 7B) upon being read by a program thread. 
Other program threads capable of servicing that request will no 
longer be presented with an active request. For example, a 
program context can request a task be assigned to the first 
context that is ready, by writing to the self-destruct register 
210. A context checks for an assignment by reading the "self- 
destruct register" 210. If the value of the self destruct 
register is 0, there is no new task presently available to assign 
to the program thread. This could indicate that there are no new 
tasks or that another program thread may have assigned itself to 
the task and cleared the self destruct register 210. If the 
value is non-zero, the contents of the self destruct register are 
interpreted to determine the task, and the register is cleared 
upon reading by the context. Thus, contexts reading this 
register for assignment wait for the register to be written 
subsequently with the next task instruction. 

Referring to FIG. 8, for networking applications 
typically different program contexts are used to perform specific 
system tasks. Tasks include receive scheduling, receive 
processing contexts, transmit arbiter, transmit scheduling, 

* 

transmit filling and processor core communications. 

.The receive scheduler initiates 242 a receive operation 
of e.g., 64 or 128 bytes of input data by sending a command to 
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the FBI interface 28 that specifies a port from which to extract 
the data and the Receive FIFO element to use to buffer that data 
as well as the microengine context to be notified once the 
receive data has been fetched. 

The receive scheduler thread 244 sends a signal to the 
specified microengine program thread that activates a specified 
context. The context reads the FBI Receive Control register to 
obtain the necessary receive information for processing (i.e. 
port, Receive FIFO element location, byte count, start of packet, 
end of packet, error status) . If a start of packet is indicated 
the receive scheduler program thread is responsible for 
determining* where in SDRAM to store the data, (i.e., the output 
queue to insert the packet) and writing the packet data into 
SDRAM. If it is not the start of a packet, then the receive 
program thread determines where the earlier data of this packet 
was stored in order to continue processing 246 of the packet. 
When the end of packet indication is received 248 (or after the 
first 64 byte section if receive to transmit latency is 
optimized) the receive program thread adds the packet to the 
queue determined by processing the packet header. 

The program threads also communicate with a shared 
resource through a bit set and bit clear mechanism that provides 

* 

■ 

a bit vector. This mechanism allows setting and clearing of 
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individual bits and performing a test and set on individual bits 
to control a shared resource. The bit vector signals the non- 
emptiness of output queues. When a receive program thread 
enqueues a packet, the receive scheduler sets 250 a bit. The 
transmit scheduler can examine the bit vector to determine the 
state all queues. 

The bit set and bit clear operations on the bit vector, 
can occur in either scratchpad RAM or SRAM. If the scheduler is 
communicating between program threads on the same micrbengine 22, 

> 

the bit vector can be stored in the register set because each 
context can read the other context's registers. For example, an 
empty or not empty status of each output queue is support by a 
• bit vector in internal scratchpad memory. When a receive program 
thread enqueues a packet, the receive program thread uses the 
scratch pad bit -set command to set a bit in the queue status bit 
vector to indicate the queue now has at least one entry. The 
transmit arbiter scans 270 the queue bit vector for non empty 
queues (e.g., bit x set) to determine packets that are ready to be 
transmitted. When removing 272 a packet from a queue for 
transmit if the queue empties 274, the transmit arbiter issues 
276 a bit -clear command to the corresponding bit of the queue bit 
..vector. 

Referring to FIG. 9, the thread_done register is also 

■ 
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on the FBI 28 and is a register where bits can be set from 
different program threads. Each program thread can use, e.g., 
two bits to communicate its status to all other program threads. 
Also one scheduler program thread can read 292 the status of all 
of its processing program threads. Upon completion of a receive 
task, 282 a "receive" program thread writes 284 a completion code 
into the "thread_done" register. The receive program thread 
becomes inactive 286 after writing the thread_done register. 
That receive program thread waits for another signal from the FBI 
that indicates another receive task has been assigned. Program 
threads 1-16 have 2 bit fields for "thread_done_l" , and program 
threads 17-24 have 2 bit fields for "thread_done_2" . The 2 bit 
field allows a program thread to communicate different levels of 
task completion. 

For example, the scheduler can use the two bit status 
"01" to indicate that data was moved to SDRAM, processing of 
packet is still in progress and pointers were saved; bits 10 can 
indicate that data was moved to SDRAM, processing of packet is 
still in progress and pointers were not saved; and bits 11 can 
indicates packet processing is completed. Thus, the states 296a 
can be used by the receiver scheduler program thread to assign 

* 

297a another thread to process a task when data becomes, 
available, whereas, states 296b can be used by the receive 

* 
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scheduler to assign 297b the same thread to continue processing 
when the data is available. 

The exact interpretation of the message can be fixed by 
a software convention determined between a scheduler program 
thread and processing program threads called by the scheduler 
program thread. That is the status messages can change depending 
on whether the convention is for receive, as above, transmit, and 
so forth. In general, the status messages include "busy", "not 
busy", "not busy but waiting." The status message of "not busy, 
but waiting" signals that the current program thread has 
completed processing of a portion of a packet and is expected to 
be assigned to perform a subsequent task on the packet when data 
is made available. It can be used when the program thread is 
expecting data from a port and has not saved context so it should 
process the rest of that packet. 

The scheduler program thread reads the "thread done" 
register to determine the completion status of tasks it assigned 
to other program threads. The "thread done" register is 
implemented as a write one to clear register, allowing the 
scheduler to clear just the fields it has recognized. 

Other Embodiments 
It is to be understood that while the invention has 
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been described in conjunction with the detailed description 
thereof, the foregoing description is intended to illustrate and 
not limit the scope of the invention, which is defined by the 
scope of the appended claims. Other aspects, advantages, and 
modifications are within the scope of the following claims. 
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What is claimed is: 

1. A method for network packet processing comprises: 
receiving network packets; and 

operating on the network packets with a plurality of program 
threads to affect processing of the packets. 

2. The method of claim 1 wherein operating comprises: 
using at least one program thread to inspect a header 

portion of the packet, 

3. The method of claim 1 wherein operating further 

■ 

comprises: 

signaling by the at least one program thread that a packet 
header has been processed. 

4 . The method of claim 1 wherein the plurality of program 
threads are scheduler program threads to schedule task orders for 
processing and processing program threads that process packets in 
accordance with task assignments assigned by the scheduler 
program threads. 

5. The method of claim 1 wherein each program thread 
writes a message to a register that indicates its current status. 
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6. The method of claim 5 wherein interpretation of the 
message is fixed by a software convention determined between a 
scheduler program thread and processing program threads called by 
the scheduler program thread. 

* 

7. The method of claim 5 wherein status messages include 
busy, not busy, not busy but waiting. 

8. The method of claim 5 wherein a status message includes 

■ 

not busy, but waiting and wherein the status of not busy, but 
waiting signals that the current program thread has completed 
processing of a portion of a packet and is expected to be 
assigned to perform a subsequent task on the packet when data is 
made available to continue processing of the program thread. 

■ 

9. The method of claim 5 wherein the register is a 
globally accessible register that can be read from or written to 
by all current program threads. 

10. The method of claim 4 wherein scheduler program threads 

■ ■-. ;canv schedule^, any one of a plurality of processing program threads' ■ ck^^y-j^*?.. * ,;v 
to handle processing of a task. 
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11. The method of claim 10 wherein the scheduler program 

■ 

thread writes a register with an address corresponding to a 
location of data for the plurality of processing program threads. 



12. The method of claim 11 wherein a selected one of the 
plurality of processing program threads that can handle the task 
reads the register to obtain the location of the data. 



13. The method of claim 12 wherein the selected one of the 
plurality of processing program threads reads the register to 
obtain the location of the data and to assign itself to 
processing the task requested by the scheduler program thread. 



14. The method of claim 12 wherein the selected one of the 
plurality of processing tasks reads the register to obtain the 

■ 

location of the data, while the register is cleared by reading 
the register by the program thread to assign itself to process 
the task. 



15. The method of claim 13 wherein when another one of the 
plurality of processing pirqgxam thready assignable to the task 
attempts to read the register after it has been cleared, it is 
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provided with a null value that indicates that there is no task 
currently assignable to the processing program thread. 

16. A parallel hardware-based multithreaded processor for 
receiving network packets comprises: 

a general purpose processor that coordinates system 
functions; and 

a plurality of microengines that support multiple program 
threads , and operate on the network packets with a plurality of 
program threads to affect processing of the packets. 

17. The processor of claim 16 wherein one of the plurality 
of microengines executes scheduler program threads and remaining 
ones of the microengines execute processing program threads. 

18. The processor of claim 16 further comprising a global 
thread status register wherein each program thread writes a 
message to the global status register that indicates its current 
status. 

» 

19. The processor of claim ,-18 ?t wherein .interpretation of the 
message is fixed by a software convention determined between a 
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scheduler program thread and processing program threads called by 
the scheduler program thread. 

20. The processor of claim 16 further comprising: 

a read once register, wherein the scheduler program thread 
writes the read once register with an address corresponding to a 
location of data for the plurality of processing program threads 
and when a selected one of the plurality of processing program 
threads reads the register to obtain the location of the data, 
assigns itself to processing the task requested by the scheduler 
program thread, while the register is cleared by reading the 
register by the program thread. 

21. The processor of claim 20 wherein when another one of 
the plurality of processing program threads assignable to the 

* 

task attempts to read the read once register after it has been 
cleared, it is provided with a null value that indicates that 
there is no task currently assignable to the processing program 
thread. 

22. An apparatus comprising a machine-readable storage 
medium having executable. instructions. for network processing, the 
instructions enabling the apparatus to: 
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receive network packets; and 

operate on the network packets with a plurality of program 
threads to affect processing of the packets. 

23. The apparatus of claim 22 wherein instructions to 
operate further comprise instructions to: 

use at least one program thread to inspect a header portion 
of the packet. 

24. The apparatus of claim 22 further comprising 
instructions to provide scheduler program threads to schedule 
task orders for processing and processing program threads to 
process packets in accordance with task assignments assigned^ by 
the scheduler program threads. 

25. The apparatus of claim 22 wherein each program thread 
writes a message to a register that indicates its current status. 

26. The apparatus of claim 25 wherein the register is a 
globally accessible register that can be read from or written to 
by all current program threads . 

27. The apparatus of claim 22 wherein the scheduler program 
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thread writes a register with an address corresponding to a 
location of data for the plurality of processing program threads 
and a selected one of the plurality of processing program threads 
that can handle the task reads the register to obtain the 
location of the data, and clears the register after reading by 
the program thread. 



28. The apparatus of claim 27 wherein when another one of 
the plurality of processing program threads assignable to the 
task attempts to read the register after it has been cleared, it 
is provided with a null value that indicates that there is no 
task currently assignable to the processing program thread. 
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READ LOCK MISS CONTROL IN A MULTITHREADED ENVIRONMENT 

BACKGROUND OF THE INVENTION 

This ' invention relates to read lock memory references. 
When a computer instruction (thread) needs access to a 
5 memory location in the computer, it goes to a memory 

controller. A memory reference instruction may request a 
read lock on a particular memory location. The read lock 
prevents other instructions from accessing that memory 
location until a read unlock instruction for that memory 
10 location gets granted. 



SUMMARY OF THE INVENTION 

According to one aspect of the invention, a method is 
described of managing memory access to random access memory 
includes fetching a read lock memory reference request and 
15 placing the read lock memory reference request at the end 
of a read lock miss queue if the read lock memory reference 
request is requesting access to an unlocked memory location 
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and the read lock miss queue contains at least one read 
lock memory reference request. 

Other advantages will become apparent from the 
following description and from the claims, 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a communication system 
employing a hardware-based multithreaded processor. 

FIG. 2 is a block diagram of a memory controller for 
latency limited operations used in the hardware-based 
multithreaded processor. 

FIG. 3 is a flowchart of the operation of a memory 
controller in the hardware-based multithreaded processor. 

FIG. 4 is a flowchart of the operation of a memory 
controller in the hardware-based multithreaded processor. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Referring to FIG. 1, a communication system 10 
includes a parallel, hardware-based multithreaded processor 
12. The hardware-based multithreaded processor 12 is 
coupled to a bus such as a peripheral component 
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interconnect (PCI) bis 14, a memory system 16, and a second 
bus 18. The system 10 is especially useful for tasks that 
can be broken into parallel subtasks or functions. 
Specifically, a hardware-based multithreaded processor 12 
5 is useful for tasks that are bandwidth oriented rather than 
latency oriented. The hardware-based multithreaded 
processor 12 has multiple microengines 22a-22f, each with 
multiple hardware controlled threads that can be 
simultaneously active and independently work on a task. 

10 The hardware-based multithreaded processor 12 also 

includes a central controller 20 that assists in loading 
microcode control for other resources of the hardware-based 
multithreaded processor 12 and performs other general 
purpose computer type functions such as handling protocols, 

15 exceptions, and extra support for packet processing where 
the microengines 22a-22f pass the packets off for more 
detailed processing such as in boundary conditions. In one 
embodiment, the processor 20 is a Strong Arm® (Arm is a 
trademark of ARM Limited, United Kingdom) based 

20 architecture. The general purpose microprocessor 20 has an 
operating system. Through the operating system the 
processor 20 can call functions to operate on microengines 
22a-22f . The processor 20 can use any supported operating 
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system, preferably a real time operating system. For the 
core processor implemented as a Strong Arm architecture, 
operating systems such as, MicrosoftNT real-time, VXWorks 
and pCUS, a freeware operating system available over the 
Internet, can be used. 

The hardware-based multithreaded processor 12 also 
includes a plurality of function microengines 22a-22f . 
Functional microengines (microengines) 22a-22f each 
maintain a plurality of program counters in hardware and 
states associated with the program counters. Effectively, 
a corresponding plurality of sets of threads can be 
simultaneously active on each of the microengines 22a-22f 
while only one is actually operating at any one time. 

In one embodiment, there are six microengines 22a-22f 
as shown. Each microengines 22a-22f has capabilities for 
processing four hardware threads. The six microengines 
22a-22f operate with shared resources including memory 
system 16 and bus interfaces 24 and 28. The memory system 
16 includes a Synchronous Dynamic Random Access Memory 
(SDRAM) controller 26a and a Static Random Access Memory 
(SRAM) controller 26b. SDRAM 16a and SDRAM controller 26a 
are typically used for processing large volumes of data, 
e.g., processing , of • network payloads from network packets. 
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SRAM 16b and SRAM controller 26b are used in a networking 
implementation for low latency, fast access tasks, e.g., 
accessing look-up tables, memory for the core processor 20, 
and so forth. 

5 The six microengines 22a-22f access either the SDRAM 

16a or SRAM 16b based on characteristics of the data. 
Thus, low latency, low bandwidth data is stored in and 
fetched from SRAM 16b, whereas higher bandwidth data for 
which latency is not as important, is stored in and fetched 

10 from SDRAM 16a. The microengines 22a-22f can execute 
memory reference instructions to either the SDRAM 
controller 26a or the SRAM controller 26b. 

Advantages of hardware multithreading can be explained 
by SRAM or SDRAM memory accesses. As an example, an SRAM 

15 access requested by a Thread_0, from a microengine 22a-22f 
will cause the SRAM controller 26b to initiate an access to 
the SRAM 16b. The SRAM controller 26b controls arbitration 
for the SRAM bus 27, accesses the SRAM 16b, fetches the 
data from the SRAM 16b, and returns data to the requesting 

20 microengine 22a-22f. During an SRAM 16b access, if the 

microengine, e.g., 22a, had only a single thread that could 
operate, that microengine would be dormant until data was 
returned from the- SRAM- 16b. 1 "The hardware context swapping 
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within each of the microengines 22a-22f enables other 
contexts with unique program counters to execute in that 
sane microengine. Thus, another thread, e.g., Thread_l, 
can function while the first thread, e.g., Thread_0, is 
5 awaiting the read data to return. During execution, 

Thread_l may access the SDRAM memory 16a. While Thread_l 
operates on the SDRAM unit 26a, and Thread__0 is operating 
on the SRAM unit 26b, a new thread, e.g., Thread_2, can now 
operate in the microengine 22a. Thread_2 can operate for a 

10 certain amount of time until it needs to access memory or 
perform some other long latency operation, such as making 
an access to a bus interface. Therefore, simultaneously, 
the processor 12 can have a bus operation, SRAM operation, 
and SDRAM operation all being completed or operated upon by 

15 one microengine 22a and have one more thread available to 
process more work in the data path. 

The hardware context swapping also synchronizes 
completion of tasks. For example, two threads could hit 
the same shared resource, e.g., SRAM 16b. Each one of 

20 these separate functional units, e.g., the FBUS interface 
28, the SRAM controller 26b, and the SDRAM controller 26a, 
when they complete a requested task' from one of the 
microengine thread con te:.ct,S.. rcp.orts ..back a flag signaling 
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completion of an operation. When the flag is received by 
the microengine 22a-22f, the microengine 22a-22f can 
determine which thread to turn on. 

One example of an application for the hardware-based 
5 multithreaded processor 12 is as a network processor. As a 
network processor, the hardware-based multithreaded 
processor 12 interfaces to network devices such as a media 
access controller device, e.g., a 10/100BaseT Octal MAC 13a 
or a Gigabit Ethernet device 13b. In general, as a network 

10 processor, the hardware-based multithreaded processor 12 
can interface to any type of communication device or 
interface that receives/sends large amounts of data. If 
communication system 10 functions in a networking 
application, it could receive a plurality of network 

15 packets from the devices 13a, 13b and process those packets 
in a parallel manner. With the hardware-based 
multithreaded processor 12, each network packet can be 
independently processed. 

Another example for use of processor 12 is a print 

20 engine for a postscript processor or as a processor for a 
storage subsystem, i.e., RAID disk storage. A further use 
is as a matching engine. In the securities industry for 
example, the advent of electronic-trading:. requires the use 

- 7 - 



WO 01/48596 PCT/US00/33395 

of electronic matching engines to match orders between 
buyers and sellers. These and other parallel types of 
tasks can be accomplished on the system 10. 

Each of the functional units, e.g., the FBUS interface 
5 28, the SRAM controller 26b, and the SDRAM controller 26a, 
are coupled to one or more internal buses. The internal 
buses are dual, 32-bit buses (i.e., one bus for read and 
one for write) . The hardware-based multithreaded processor 
12 also is constructed such that the sum of the bandwidths 

10 of the internal buses in the processor 12 exceeds the 

bandwidth of external buses coupled to the processor 12. 
The processor 12 includes an internal core processor bus 
32, e.g., an ASB bus (Advanced System Bus), that couples 
the processor core 20 to the memory controller 26a, 26b and 

15 to an ASB translator 30. The ASB bus 32 is a subset of the 
so-called AMBA bus that is used with the Strong Arm 
processor core. The processor 12 also includes a private 
bus 34 that couples the microengine units to SRAM 
controller 26b, ASB translator 30, and FBUS interface 28. 

20 A memory bus 38 couples the memory controllers 26a, 26b to 
the bus interfaces 24 and 28 and memory system 16 including 
a flashrom 16c used for boot operations and so forth. 

- 8 - 
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Referring to FIG. 2, the SRAM controller 26b for the 
SRAM 16b is shown. The SRAM controller 26b includes an 
address and command queue 120. SRAM controller 26b is 
optimized based on the type of memory operation, i.e., a 
read or a write. The address and command queue 120 
includes a high priority queue 120a (holding memory 
reference requests from high priority tasks), a read queue 
120b (holding read memory reference requests, which are the 
predominant memory reference functions that an SRAM 
performs), and an order queue 120c (holding, in general, 
all writes to SRAM 16b and reads that are to be non- 
optimized) . Although not shown, the address and command 
queue 120 could also include a write queue. An insert 
queue control unit 132 of control logic determines where to 
queue each memory request from the microengines 22a-22f . 
An arbitration scheme in a remove queue arbitration unit 
124 of control logic determines the processing order of the 
memory reference requests in the queues 120. 

The SRAM controller 26b also includes core bus 
interface logic, e.g., ASB bus 122. The ASB bus interface 
logic 122 interfaces the core processor 20 to the SRAM 
controller 26b. The ASB bus 122 is a bus that includes a 
32-bit data path and a 28-bit address path. ^The data is 
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accessed to and from memory 16b through an MEM 
(microelectromechanical ) ASB data device 128, e.g., a 
buffer. The MEM ASB data device 128 is a queue for write 
data. If there is incoming data from the core processor 20 
via ASB interface 122, the data can be stored in the MEM 
ASB data device 128 and subsequently removed from the MEM 
ASB data device 128 through an SRAM interface 140 to SRAM 
memory 16b. Although not shown, the same queue structure 
can be provided for reads. 

The memory requests are sent to SRAM interface 140 via 
a multiplexer 126. The multiplexer 126 is controlled by an 
SRAM arbiter 131 which detects the fullness of each of the 
queues and the status of the requests and from that decides 
priority based on a programmable value stored in a priority 
service control register 130. Once control to the 
multiplexer 126 selects a memory reference request, the 
memory reference request is sent to a command decoder and 
address generator 138 where it is decoded and an address is 
generated. 

The SRAM controller 26b maintains control of the 
Memory Mapped off -chip SRAM and Expansion ROM. The SRAM 
controller 26b can address, e.g., 16 MBytes, with, e.g., 8 
MBytes mapped for SRAM 16b, and 8 MBytes reserved for . . 
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special functions including: Boot space via flashrom 16c; 
and Console port access for MAC devices 13a, 13b and access 
to associated (RMON) counters. The SRAM is used for local 
look-up tables and queue management functions. 
5 The SRAM controller 26b supports the following 

transactions : 

Microengine requests (via private bus) to/from SRAM. 
Core Processor (via ASB bus) to/from SRAM. 

10 

The SRAM controller 26b performs memory reference 
sorting to minimize delays (bubbles) in the pipeline from 
the SRAM interface 140 to memory 16b. The SRAM controller 
26b does memory reference sorting based on the read 
15 function. A bubble can either be one or two cycles 
depending on the type of memory device employed. 

The SRAM controller 26b includes a lock look-up device 
142/ an eight-entry address content addressable memory 
(CAM) for look-ups of read locks. Each position includes a 
20 valid bit that is examined by subsequent read lock 

requests. The address and command queue 120 also includes 
a read lock miss (fail) queue 120d. The read lock miss 
queue 120d is used to hold up to twenty-four read requests 
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that fail because of a lock existing on the portion of 
memory requested to be read. One of the microengines 22a- 
22f issues a read lock memory reference request that is 
processed in address and control queue 120. 
5 Referring to FIG. 3, a memory reference request is 

processed 51 by the controller 26b. The controller 26b 
receives 50 a memory reference request on either the order 
queue 120c or the read queue 120b. The command decoder and 
address generator 138 determines 52 whether the memory 

10 reference request is a "read" request, a write request, a 
"read lock" request, or a write unlock ("unlock") request. 
If the command decoder and address generator 138 determines 
52 that the request is a read request, then the command 
decoder and address generator 138 enters 54 the lock into 

15 the lock look-up device 142 so that subsequent read lock 
requests will find the memory location locked. Lock 
requests cannot be processed on that locked memory location 
until an unlock request from a microengine 22a-22f unlocks 
58 that memory location. 

20 If the command decoder and address generator 138 

determines 52 that the memory reference request is a read 
lock request, the controller 26b will access 56 lock look- 
. up. device 142 to determine whether this memory location is. 
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already locked. The access 56 determines whether the read 
lock request gets performed 62 or enters 64 the read lock 
miss queue 120d where it waits for the memory location to 
be unlocked. 

If the memory location is locked from any prior lock 
request, the read lock request fails and gets stored 64 at 
the end of the read lock miss queue 120d. The controller 
26b proceeds to receive 50 the next request. 

If the memory location is unlocked, the lock look-up 
device 142 checks 60 to see if the read lock miss queue 
120d is empty. If the read lock miss queue 120d is empty, 
that indicates that there are no previous read lock 
requests awaiting processing. Thus, the read lock request 
will be processed: the address of the memory reference will 
be used by the SRAM interface 140 to perform 62 an SRAM 
address read/write request to memory 16b and subsequently 
lock the memory location. Then the controller 26b receives 
50 the next request. 

If the read lock miss queue 120d is not empty, that 
indicates that there is at least one prior read lock 
request awaiting processing. To ensure a first-in/first- 

» 

out order of processing the read lock requests, the read 
i lock -request will be stored 64 at the end of the read lock 
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miss queue 120d. Read lock requests continue queuing in 
the read lock miss queue 120d until the controller 26b 
receives 50 an unlock request and recognizes 52 it as such. 
A memory location is unlocked by operation of a 
5 microcontrol instruction in a program after the need for 
the lock has ended. Once the unlock request is received 
50, the memory location is unlocked 58 by clearing the 
valid bit in the CAM 142. After the unlock 58, the read 
lock fail queue 120d becomes the highest priority queue 

10 120, giving all queued read lock misses a chance to issue a 
memory lock request. 

Referring to FIG. 4, the read lock miss queue 120d is 
tested 69 to see if the head of the read lock miss queue 
120d may be granted its lock. The testing 69 will continue 

15 until the read lock miss queue 120d fails to provide a 

grant or until there are no more entries in the read lock 
miss queue 120d. The highest priority is given to the read 
lock miss queue 120d by letting it win 70 the next 
arbitration in the remove queue arbitration unit 124 . The 

20 lock look-up device 142 checks 72 to see if the top read 
lock request (the one stored first in the read lock miss 
queue 120d) is requesting to read an unlocked memory 
location*. ; If the read lock request is for a locked 
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location, the read lock miss queue 120d remains as it is to 
ensure the f irst-in/ l:irst-out order described above, and 
the testing 69 ends 78. If the read lock request is for an 
unlocked location, the address of the memory reference will 
be used by the SRAM interface 140 to perform 74 an SRAM 
address read/write request to memory 16b. After performing 
74, the lock look-up device 142 checks 76 to see if the 
read lock miss queue 120d is empty. If it is empty, the 
testing 69 ends 78. Otherwise, the testing 69 continues 
with the read lock miss queue 120d winning 70 another 
arbitration in the remove queue arbitration unit 124. 

Other embodiments are within the scope of the 
following claims. 
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What is claimed is: 

1. A method of managing memory access to random 
access memory comprises: 

fetching a read lock memory reference request; and 
5 placing the read lock memory reference request at the 

end of a read lock miss queue if the read lock memory 
reference request is requesting access to an unlocked 
memory location and the read lock miss queue contains at 
least one read lock memory reference request. 

10 2. The method of claim 1 in which the random access 

memory is located in a parallel, hardware-based 
multithreaded processor. 

3. The method of claim 1 in which the random access 
memory comprises a static random access memory. 

15 4. The method of claim 1 further comprises placing 

the read lock memory reference at the end of a read lock 
miss queue if the read lock memory reference request is 
requesting access to a locked memory location. 

5. The method of claim 1 further comprises removing 
20 the read lock memory reference requests from the read lock 
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miss queue in the order in which they were placed in the 
read lock miss queue. 

6. The method of claim 1 in which the read lock 
memory reference request is fetched from a read queue. 

5 7. The method of claim 1 in which the read lock 

memory reference request is fetched from an order queue. 

8. The method of claim 1 in which the read lock miss 
queue contains a number of entries equal to a number of 
contexts that may execute the read lock memory reference 

10 requests minus one. 

9. The method of claim 1 further comprises placing a 
read lock memory reference request in a queue that contains 
a read lock memory reference request requesting access to 
the same memory location. 

15 10. An article comprising a computer-readable medium 

which stores computer-executable instructions for managing 
memory access to random access memory, the instructions 
causing a computer to: 

fetch a read lock memory reference request; and 

20 place the read lock memory reference request at the 
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end of a read lock miss queue if the read lock memory 
reference request is requesting access to an unlocked 
memory location and the read lock miss queue contains at 
least one read lock memory reference request. 

11. The article of claim 10 in which the random 
access memory is located in a parallel, hardware-based 
multithreaded processor. 

12. The article of claim 10 in which the random 
access memory comprises a static random access memory. 

13. The article of claim 10 in which the instructions 
further cause a computer to place the read lock memory 
reference request at the end of a read lock miss queue if 
the read lock memory reference request is requesting access 
to a locked memory location. 

14. The article of claim 10 in which the instructions 
further cause a computer to remove the read lock memory 
reference requests from the read lock miss queue in the 
order in which they were placed in the read lock miss 
queue . 
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15. The article of claim 10 in which the read lock 
memory reference request is fetched from a read queue. 

16. The article of claim 10 in which the read lock 
memory reference request is fetched from an order queue. 

5 17. The article of claim 10 in which the read lock 

miss queue contains a number of entries equal to a number 
of contexts that may execute the read lock memory reference 
requests minus one. 

18. The article of claim 10 in which the instructions 
10 further cause a computer to place a read lock memory 

reference request in a queue that contains a read lock 
memory reference request requesting access to the same 
memory location. 

19. A controller for managing access to random access 
15 memory comprises: 

an address content addressable memory that supports 
locked locations in memory; and 

a read lock miss queue that stores read lock memory 
reference requests that request access to memory locations 
20 unlocked in the address content addressable memory when the 
read lock miss queue contains at least one entry. 
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20. The controller of claim 19 in which the random 
access memory is located in a parallel hardware^based 
multithreaded processor. 

21. The controller of claim 19 in which the random 
5 access memory comprises a static random access memory. 

22. The controller of claim 19 in which the read lock 
memory reference requests are stored at the end of the read 
lock miss queue if the read lock memory reference request 
requests access to a locked memory location. 

10 23. The controller of claim 19 in which the read lock 

memory reference requests are removed from the read lock 
miss queue in the order in which they were placed in the 
read lock miss queue. 

24. The controller of claim 19 in which the read lock 
15 miss queue contains a number of entries equal to a number 

of contexts that may execute the read lock memory reference 
requests minus one. 
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